Introduction

The superheat package was developed to produce supervised heatmaps which act as a data exploration tool designed to visually explore complex datasets. The highly customisable image is generated by combining a clustered heatmap with scatterplots with the aim of both visualizing the information contained in our data as well as assessing the adequacy of models fit to our data.


Downloading and installing the package

The goal of this guide is to help you understand how to use the superheat package in R to assess your data model of interest. First, you need to download and install the package. This can be done using the devtools package. If you have not yet done so, you will need to download it. The devtools package allows the user to download and install packages hosted on github pages, such as this one. To do this for most systems you need to type the following code into your R console:

install.packages("devtools")
devtools::install_github("rlbarter/superheat")

Next, you can load the superheat library into your workspace:

library(superheat)

The contents of the plot

As mentioned above, the primary aim of this package is to produce a supervised heatmap that can be used to both explore your data and diagnose areas of your data where your model may be performing sub-optimally (and to gives clues about possible improvements). The plot consists of two elements:

  1. a clustered heatmap: a clustered version of the \((n \times p)\) design matrix \({\bf X}\)).

  2. a plot above and to the right of the heatmap: these plots could be scatterplots, barplots, scatterplots with a smoothed curve, isolated smoothed curves and line plots.

  3. cluster or variable labels below and to the left of the heatmap: these labels correspond to the cluster numbers/names or the variable names.


Things you should keep in mind

The heatmap in the center of the plot is of a matrix, \({\bf X}\), in which, by default, the rows are clustered into a number of distinct clusters. The colour scale is presented below the heatmap.


Usage

The package consists of a single function: superheat.

The superheat function takes data objects, the most important of which are X (the heatmap matrix), yr (a vector of values to be plotted to the right of the heatmap) and yt (a vector of values to be plotted above the heatmap), although both yr and yt are optional. For example, the plot could be generated for the famous iris dataset as follows

# define a linear model and isolate coefficients for each variable
iris.coef <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris)$coef

# generate the plot:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], # heatmap matrix
            yr = iris[,"Petal.Length"], #plot Petal.Length to the right
            yr.axis.name = "Petal.Length",
            membership.rows = iris[,"Species"], # cluster the rows by species
            yt = iris.coef[-1], # plot the model coefficients above
            yt.plot.type = "bar", # make the plot above a barplot
            yt.axis.name = "coefficient")

This plot shows us a heatmap of the design matrix, \(X\), which consists of 150 rows (the first 10 of which are shown below) and 3 variables (the first three columns of the table below). We also have a species variable (the fourth column of the table below) that we are using as our cluster membership identifier. Finally, on the right hand side of the plot we are plotting a fourth variable Sepal.Length. We can see from this plot that the virginica species has longer petals and sepals than Setosa which has short sepals and particularly short petal width.

Sepal.Width Sepal.Length Petal.Width Species
3.5 5.1 0.2 setosa
3 4.9 0.2 setosa
3.2 4.7 0.2 setosa
3.1 4.6 0.2 setosa
3.6 5 0.2 setosa
3.9 5.4 0.4 setosa
3.4 4.6 0.3 setosa
3.4 5 0.2 setosa
2.9 4.4 0.2 setosa
3.1 4.9 0.1 setosa

Further, if you are one of those people who prefer their design matrix to be \(n \times p\) instead of \(p \times n\), we can accomodate you too! For example, we are plotting a heatmap of the transpose of \(X\), as well as the Sepal.Length variable above (yt corresponds to the variable to be plotted on top of the heatmap and yr corresponds to the variable to be plotted to the right of the heatmap). Note that below we must specify cluster.rows = FALSE, since the default is to cluster the rows but not the columns.

superheat(X = t(iris[,c("Sepal.Width","Sepal.Length","Petal.Width")]), 
            yt = iris[,"Petal.Length"], 
            yt.axis.name = "Petal.Length",
            membership.cols = iris[,"Species"], 
            cluster.rows = FALSE,
            yr = iris.coef[-1], # plot the model coefficients above
            yr.plot.type = "bar", # make the plot above a barplot
            yr.axis.name = "coefficient",
            bottom.label.pal = "white")


Customizing the plot

Below we continue with the iris example and show how to use the numerous options for customizing the plot

Clustering options (row and column)

In the iris example above, we have specified our own cluster vector (the Species variable which consists of three classes: Setosa, Versicolor and Virginica). However, if you do not have a cluster membership vector in mind (or cannot be bothered to generate one using your clustering algorithm of preference), never fear! The superheat function has in-built clustering options for performing K-means (the default if a membership vector is not supplied) and hierarchical clustering (specify clustering.method = "hierarchical") on the rows and/or columns.\

set.seed(134)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            n.clusters.rows = 3,
            clustering.method = "kmeans")

Again we note that, by default, if no membership vectors are supplied, the default is to cluster the rows but not the columns. As a result, the number of row clusters must be supplied (above we have n.clusters.rows = 3), otherwise an error will be produced. If you wish to cluster the columns also, you could specify a number of column clusters using the n.cluster.col argument.

If you wanted to keep the variable names as the labels, you could simply specify bottom.heat.label = "variable".

set.seed(134)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            n.clusters.rows = 3,
            n.clusters.col = 2,
            bottom.heat.label = "variable")

Moreover, if you have used the in-built clustering algorithms, you can obtain the clustering membership vector by saving the superheat object and acessing the membership element (below we hide the plot itself, which is the same as the previous plot, by specifying print.plot = F).

set.seed(134)
plot <- superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            n.clusters.rows = 3,
            print.plot = F)
plot$membership.rows
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#>  37  38  39  40  41  43  44  45  46  47  48  49  50  42  51  52  53  55 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2 
#>  56  57  58  59  60  61  62  64  65  66  67  68  70  71  72  74  75  76 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
#>  77  78  79  80  81  82  83  84  85  86  87  89  90  91  92  93  94  95 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
#>  96  97  98  99 100 104 107 111 117 118 125 126 128 130 132 134 135 137 
#>   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
#> 138 139 149 150  54  63  69  73  88 101 102 103 105 106 108 109 110 112 
#>   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
#> 113 114 115 116 119 120 121 122 123 124 127 129 131 133 136 140 141 142 
#>   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
#> 143 144 145 146 147 148 
#>   3   3   3   3   3   3

We note that the k-means clustering corresponds mostly to the Species variable.

Plot types

The top/right plots can be one of several types. The default is a scatterplot. The following other options are also available:

  • A scatterplot with loess smoothed curve
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "scattersmooth")

  • A scatterplot with fitted line
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "scattersmooth",
            smoothing.method = "lm")

  • A smoothed curve
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "smooth")

  • A line
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "line")

  • A line with points
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "scatterline")

  • A barplot (this typically makes more sense for variables)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "bar",
            yr.bar.col = "grey")

  • A boxplot
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.plot.type = "boxplot")

Layout

It is not necessary to have all components at all times in your plot. For example, if you do not supply a yt vector, no top scatterplot is provided. Similarly if you do not supply a yr vector. Further options include:

  • You can remove the legend:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            membership.rows = iris[,"Species"],
            legend = FALSE)

  • You can remove the left cluster/variable labels
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            membership.rows = iris[,"Species"],
            left.heat.label = "none")

  • You can remove the bottom cluster/variable labels
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"], 
            yr.axis.name = "Petal.Length",
            membership.rows = iris[,"Species"],
            bottom.heat.label = "none")

  • You can remove the scatterplot axis (similarly for the top axis)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis = F)

  • You can add a title
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            title = "I'm a title!")

  • You can add white padding (in cm) around the plot (the default is 2cm):
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            padding = 2.5)

  • you can remove the cluster boxes:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            cluster.box = FALSE)

Colour palettes

The default colour palates are as shown in the above examples, however the user can specify their own colour palates:

  • There are a number of optional colour schemes for the heatmap including “red” (the default), “purple”, “blue” (shown below), “grey” and “green”. These can be specified using the heat.col.scheme argument.
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            heat.col.scheme = "blue")

  • You can also specify your own colour scheme for the heatmap using the heat.pal argument. You can specify as many colours as you want.
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            heat.pal = c("yellow", "brown", "blue")) 

  • If your data are centered around 0 (e.g. standardized values), you may want to use a diverging palette. You can do this using the heat.pal argument and selecting a diverging palette (e.g. the Red/Blue palette from the RColorBrewer color brewer package).
library(RColorBrewer)
scaled_iris <- scale(iris[,c("Sepal.Width","Sepal.Length","Petal.Width")])
superheat(X = scaled_iris, 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            heat.pal = brewer.pal(7, "RdBu"))

  • To specify the values at which the color changes occur, you can use the heat.pal.values argument. The length of heat.pal.values must be the same as the length of the heat.pal argument.
library(RColorBrewer)
scaled_iris <- scale(iris[,c("Sepal.Width","Sepal.Length","Petal.Width")])
superheat(X = scaled_iris, 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            heat.pal = brewer.pal(7, "RdBu"),
            heat.pal.values = c(0, 0.2, 0.25, 0.3, 0.7, 0.8, 1))

  • You can specify the colour of the points in the scatterplot (replace yr with yt for the top scatterplot). Note that the order of the points in the yr.obs.col vector correspond to the order of the points in yr.
col.vec <- rep("black", 150)
col.vec[50:70] <- "red"

superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.obs.col = col.vec)

  • Or you can simply specify the colour for each cluster of yr.
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.pal = c("red","blue","orange"))

  • You can specify the cluster label (and variable label) colors. (The equivalent for the bottom labels can be specified using bottom.label.pal and bottom.label.text.col)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            left.label.pal = c("blue","red","purple"),
            left.label.text.col = c("white","black","white"))

  • You can specify the colour of the cluster boxes
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            box.col = "white")

Heatmap smoothing

  • You can smooth the colors within each cluster of the heatmap
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            smooth.heat = TRUE)

Sizing

There are a number of options to change the text size, point size and legend size in the plot. For example, you can specify the following arguments in the superheat function:

  • You can change the legend size
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            legend.size = 4)

  • You can change the cluster/variable label size (the default is 0.1) and the label text angle
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            bottom.label.size = 0.5,
            bottom.text.angle = 90,
            left.label.size = 0.4,
            left.text.angle = 0)

  • You can change the cluster/variable label text size (the default is 5)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            left.text.size = 7,
            bottom.text.size = 2)

  • You can change the scatterplot axis size and the axis label size (the default for both is 10).
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.axis.size = 15,
            yr.axis.name.angle = 10,
            yr.axis.name.size = 20)

  • you can change the point size
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            yr.point.size = 3.5)

  • You can change the thickness of the cluster boxes
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            box.size = 2)

Ordering

  • you can change order of observations within each cluster. For example, we can order the observations by Sepal.Width
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            order.rows = order(iris$Sepal.Width))

Saving the images

The best format for saving superheat images is as a .png file. To do this in R, the easiest way is to envoke the png() function (remember to call dev.off() when you’re done!)

png("superheat.png", height = 500, width = 800)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], 
            yr = iris[,"Petal.Length"],
            membership.rows = iris[,"Species"],
            yr.axis.name = "Petal.Length",
            order.rows = order(iris$Sepal.Width))
dev.off()

The end

Thanks for using the package! I hope you find it helpful in your data exploration adventures. The development page can be found at https://github.com/rlbarter/superheat. For pull requests and suggestions please follow the standard protocol. For pressing questions or comments feel free to email me at rebeccabarter@berkeley.edu.